Video Based Recognition of Hand Gestures by Neural Networks for the Control of Sound and Music
نویسندگان
چکیده
In recent years video based analysis of human motion gained increased interest, which for a large part is due to the ongoing rapid developments of computer and camera hardware, such as increased CPU power, fast and modular interfaces and high quality image digitisation. A similar important role plays the development of powerful approaches for the analysis of visual data from video sources. In computer music this development is reflected by a row of applications approaching the analysis of video and image data for gestural control of music and sound such as Eyesweb, Jitter, CV ([1,[2], [3]). Recognition and interpretation of hand movements is of great interest both in the areas of music and software engineering ([4], [5], [6]). In this demo an approach is presented for the control of music and sound parameters through hand gestures, which are recognised by an artificial neural network (ANN). The recognition network was trained with appearancebased features extracted from image sequences of a video camera. 1. A SET OF CYCLIC HANDGESTURES Previous experiments showed that hand gestures may be combined as cyclic gestures such as waving the hand or pointing to the left and to the right with the index finger [7]. Gesture Description Short names of main states Index up/down Index finger moves up and down indUp, indDo Index left/right Index finger moves left and right indLe, indRi Cut up/down Flat hand moves up and down cutUp, cutDo Cut left/right Flat hand moves left and right cutLe, cutRi Horizontal open/close Hand with horizontal back opens and closes horOp, horCl Vertical open/close Hand with vertical back opens and closes verOp, verCl Croco open/close Hand with thumb opens and closes corOp, corCl Swing open/close Hand turns and opens and turns and closes swiOp, swiCl Table 1: A set of cyclic gestures of the left hand For this, the motion of a gesture is grouped in at least two main states performed in a repetitive way. The whole gesture may then be seen as a progression through a cyclic state model with the aim to view the gesture not as an isolated event but in the gesture context and related motions. 2. VARIATION OF GESTURE INSTANCES Each gesture was recorded at 3 lower positions and 2 upper positions of the gestural space of the hand and arm to obtain data reflecting the variance of the hand articulation at differing locations. All 5 recording instances were aimed to be in a plane parallel to the front of the camera. Blended hand positions for the static states of four cyclic gesture types are shown in Figure 1 to Figure 4. 3. TIME DELAY NEURAL NETWORKS Time Delay Neural Networks (TDDN) are feed-forward networks and incorporate the learning of time series through a series of data windows (delays) shifting in time over the data series. An exemplary TDNN would consist of 4x4 input units and 4 input delay frames. To apply such a TDNN to image features larger input frames were used i.e. 1024 or 256 input units and a hidden layer size of 50 units ([8].) The number of output units was in the range of 24 to 37 similar as in the shown Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NIME08, June 5-7, 2008, Genova, Italy Copyright remains with the author(s). Figure 1: Horizontal open/close Figure 2: Vertical open/close Figure 3: Croco open/close Figure 4: Swing open/close example. In our approaches each output unit was associated with a certain state of the training patterns for the network. 4. GESTURAL CONTROL OF SOUND The system tries to realise the control of a sound generation process by using discrete bindings of gestures to sound parameters. The system uses gestures of the left hand which are recognised by the video analysis. Two identical sound generation processes for live sampling and sound modification are realised in a Max/Msp patch. The position space of the hand is divided through a dedicated object (Gitter) into 9 concentric fields (Figure 5). Figure 5: division of gesture plane into concentric fields The binding of a parameter group to a Gitter field is aimed to locate the more important and more often-used parameters in the centre of the position space, and less-often used parameter groups around the central area. Remote to body Centre Close to body Effect distortion (low) Diffusion/panning (mid) Reverb (low) Volume (low) Selection of sound (mid) Recording (mid) Effect ring-modulation (low) Filter (high) Granulation (high) Table 2: Binding of sound parameter groups In Table 2 the estimated complexity and number of parameters is given in brackets. Each field in the gesture coordinate extends into a list of selectable choices. In the stage setup (Figure 7) these choices are mainly implemented as settings for the parameter category associated with the field, representing a morphing state of the soundproceesing patch (Figure 6). Figure 6: Display of sound actions (reverb) bound to the current field of the hand position 5. RESULTS The demo system is a protoype for the usage of gesture recognition with artificial neural networks integrated in a sound generation context. The degree of required recognition precision varies between different performance paradigms. It may range from an aleatoric approach where it low recognition rate of 75% to 85% is sufficient to a strict binding of the gestures to a complex control of an elaborate instrument, where a single missrecognition will disturb the whole musical concept or at least will be perceived as a hindering error. For both extremes, the aleatoric and the strict approach the binding of the body gestures to the musical actions have to be considered thoroughly. A similar situation may be found for the required number of recognisable gestures, which differs between the musical intention and the role it assigns the gesture recognition. Two or three gestures may be enough to play a central role in a piece. For a complex control a larger number of gestures is required e.g. more than the 16 gesture states of the hand used for the training of the neural network of the demo system. Figure 7: Stage setup The goal in the applied setup was not to use as many as possible mappings provided from the gesture recognition or the visual analysis but to focus on the case, that visual musical control is achieved by an automatic recognition process. Gesture recognition may be only one part in the dramaturgy of a piece but it reflects the development of information technology, which approaches human mind and human body.
منابع مشابه
Hand Gesture Recognition from RGB-D Data using 2D and 3D Convolutional Neural Networks: a comparative study
Despite considerable enhances in recognizing hand gestures from still images, there are still many challenges in the classification of hand gestures in videos. The latter comes with more challenges, including higher computational complexity and arduous task of representing temporal features. Hand movement dynamics, represented by temporal features, have to be extracted by analyzing the total fr...
متن کاملNeural Network Performance Analysis for Real Time Hand Gesture Tracking Based on Hu Moment and Hybrid Features
This paper presents a comparison study between the multilayer perceptron (MLP) and radial basis function (RBF) neural networks with supervised learning and back propagation algorithm to track hand gestures. Both networks have two output classes which are hand and face. Skin is detected by a regional based algorithm in the image, and then networks are applied on video sequences frame by frame in...
متن کاملEffect of sound classification by neural networks in the recognition of human hearing
In this paper, we focus on two basic issues: (a) the classification of sound by neural networks based on frequency and sound intensity parameters (b) evaluating the health of different human ears as compared to of those a healthy person. Sound classification by a specific feed forward neural network with two inputs as frequency and sound intensity and two hidden layers is proposed. This process...
متن کاملAn Experimental Set of Hand Gestures for Expressive Control of Musical Parameters in Realtime
This paper describes the implementation of Time Delay Neural Networks (TDNN) to recognize gestures from video images. Video sources are used because they are non-invasive and do not inhibit performer's physical movement or require specialist devices to be attached to the performer which experience has shown to be a significant problem that impacts musicians performance and can focus musical reh...
متن کاملPattern Recognition in Control Chart Using Neural Network based on a New Statistical Feature
Today for the expedition of the identification and timely correction of process deviations, it is necessary to use advanced techniques to minimize the costs of production of defective products. In this way control charts as one of the important tools for the statistical process control in combination with modern tools such as artificial neural networks have been used. The artificial neural netw...
متن کامل